An Algorithm for Approximate Tandem Repeats
نویسندگان
چکیده
A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g., abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g., abcdaacd. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = umacro û, for which the Hamming distance of umacro and û is at most k, in O(nk log (n/k)) time, or all those for which the edit distance of umacro and û is at most k, in O(nk log k log (n/k)) time. This paper concentrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u', where u is a prefix of r and u' is a prefix of u. An approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical. We precisely define approximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k)) where a is the maximum number of periods in any reported repeat. We present some experimental results concerning the performance and sensitivity of our algorithm. The problem of finding repeats within a string is a computational problem with important applications in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats occurring in the genome are known to be related to diseases in the human.
منابع مشابه
Detection of Signiicant Patterns by Compression Algorithms : the Case of Approximate Tandem Repeats in Dna Sequences. Rivals
0 To whom the reprint requests should be sent. 2 Abstract We use compression algorithms to analyse genetic sequences. The basic idea is that a compression algorithm is associated with a property. The more a sequence is compressed by the algorithm, the more signiicant is the property for that sequence. Here we present an algorithm to detect a particular type of dosDNA (Deened Ordered Sequence-DN...
متن کاملSTAR: an algorithm to Search for Tandem Approximate Repeats
MOTIVATION Tandem repeats consist in approximate and adjacent repetitions of a DNA motif. Such repeats account for large portions of eukaryotic genomes and have also been found in other life kingdoms. Owing to their polymorphism, tandem repeats have proven useful in genome cartography, forensic and population studies, etc. Nevertheless, they are not systematically detected nor annotated in geno...
متن کاملFinding the Position of the k-Mismatch and Approximate Tandem Repeats
Given a pattern P , a text T , and an integer k, we want to find for every position j of T , the index of the k-mismatch of P with the suffix of T starting at position j. We give an algorithm that finds the exact index for each j, and algorithms that approximate it. We use these algorithms to get an efficient solution for an approximate version of the tandem repeats problem with k-mismatches.
متن کاملC OMPUTATION AND A NALYSIS A thesis presented in partial fulfilment of the requirements
Biological sequences have long been known to contain many classes of repeats. The most studied repetitive structure is the tandem repeat where many approximate copies of a common segment (the motif ) appear consecutively. In this thesis, a complex repetitive structure is investigated. This repetitive structure is called a nested tandem repeat. It consists of many approximate copies of two motif...
متن کاملTandem repeats over the edit distance
MOTIVATION A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repea...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of computational biology : a journal of computational molecular cell biology
دوره 8 1 شماره
صفحات -
تاریخ انتشار 1993